ASR for South Slavic Languages Developed in Almost Automated Way
نویسندگان
چکیده
Slavic languages pose several specific challenges that need to be addressed in an ASR system design. Since we have already built an engine suited for highly-inflected languages, we focus on adopting it for new languages, now. In this case, we present an efficient way to adapt the system to all (seven) South Slavic languages, using methods and tools that benefit from language similarities, easily adjustable G2P rules or common phonetic subsets. We show that it is possible to build accurate language and acoustic models in an almost automated way, entirely from resources found on the web. The AMs are trained via cross-lingual bootstrapping followed by lightly supervised retraining from public data, like broadcast and parliament archives. Tests done on a set of main broadcast news in each language show WER values in range 16.8 to 21.5 %, which includes also errors caused by OOL (out-of-language) utterances often occurring in this type of spoken programs.
منابع مشابه
Speech Technologies for Serbian and Kindred South Slavic Languages
This chapter will present the results of the research and development of speech technologies for Serbian and other kindred South Slavic languages used in five countries of the Western Balkans, carried out by the University of Novi Sad, Serbia in cooperation with the company AlfaNum. The first section will describe particularities of highly inflected languages (such as Serbian and other language...
متن کاملA smartphone-based ASR data collection tool for under-resourced languages
Acoustic data collection for automatic speech recognition (ASR) purposes is a particularly challenging task when working with underresourced languages, many of which are found in the developing world. We provide a brief overview of related data collection strategies, highlighting some of the salient issues pertaining to collecting ASR data for under-resourced languages. We then describe the dev...
متن کاملPIE inheritance and word-formational innovation in Slavic motion verbs in -i-
The unprefixed imperfective verbs of motion with present tense in -i (such as Russian vodit’, vozit’, bežat’), most of which are considered indeterminate in the modern languages, developed over a lengthy period from Proto-Indo-European to the disintegration of Proto-Slavic. The final period of their development in Slavic shows striking innovation in the formal and semantic structures, including...
متن کاملGenetic Heritage of the Balto-Slavic Speaking Populations: A Synthesis of Autosomal, Mitochondrial and Y-Chromosomal Data
The Slavic branch of the Balto-Slavic sub-family of Indo-European languages underwent rapid divergence as a result of the spatial expansion of its speakers from Central-East Europe, in early medieval times. This expansion-mainly to East Europe and the northern Balkans-resulted in the incorporation of genetic components from numerous autochthonous populations into the Slavic gene pools. Here, we...
متن کاملLanguage Related Issues for Machine Translation between Closely Related South Slavic Languages
Machine translation between closely related languages is less challenging and exhibits a smaller number of translation errors than translation between distant languages, but there are still obstacles which should be addressed in order to improve such systems. This work explores the obstacles for machine translation systems between closely related South Slavic languages, namely Croatian, Serbian...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016